Machine Learning Analysis Pipeline
EDR: Dataset Loading & Preprocessing
EDR – Train/Test Overview
• Train shape: (185442, 20) | Test shape: (16287, 20)
• Total train samples: 185,442 | Total test samples: 16,287
• Number of features: 16
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (185442, 20) | Test shape: (16287, 20)
• Total train samples: 185,442 | Total test samples: 16,287
• Number of features: 16
• Target column: 'label'
• Missing values (train): 0 | (test): 0
EDR – Train Class Distribution
• 0: 184,585
• 1: 857
• Class balance (minority/majority): 0.4643%
• 0: 184,585
• 1: 857
• Class balance (minority/majority): 0.4643%
EDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.4643% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
• Minority class represents only 0.4643% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9955
EDR: Model Performance Comparison
EDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.9410 | 0.6610 | 0.0297 | 0.3784 | 0.0551 | 0.6427 | 0.0262 |
| Random Forest (SMOTE) | 0.8831 | 0.5983 | 0.0123 | 0.3108 | 0.0236 | 0.7979 | 0.0263 |
| LightGBM | 0.7034 | 0.6223 | 0.0083 | 0.5405 | 0.0163 | 0.6637 | 0.0071 |
| Balanced RF | 0.8914 | 0.6831 | 0.0198 | 0.4730 | 0.0381 | 0.8597 | 0.0581 |
| SGD SVM | 0.9322 | 0.6296 | 0.0222 | 0.3243 | 0.0416 | nan | nan |
| IsolationForest | 0.9916 | 0.5183 | 0.0441 | 0.0405 | 0.0423 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 15298 | 915 | 46 | 28 | 5.64% | 62.16% |
| Random Forest (SMOTE) | 14360 | 1853 | 51 | 23 | 11.43% | 68.92% |
| LightGBM | 11416 | 4797 | 34 | 40 | 29.59% | 45.95% |
| Balanced RF | 14483 | 1730 | 39 | 35 | 10.67% | 52.70% |
| SGD SVM | 15158 | 1055 | 50 | 24 | 6.51% | 67.57% |
| IsolationForest | 16148 | 65 | 71 | 3 | 0.40% | 95.95% |
Best Models by Metric
Accuracy
IsolationForest
0.9916
Balanced Acc
Balanced RF
0.6831
Precision
IsolationForest
0.0441
Recall
LightGBM
0.5405
F1
Logistic Regression
0.0551
ROC-AUC
Balanced RF
0.8597
PR-AUC
Balanced RF
0.0581
Lowest False Positive Rate
IsolationForest
0.40%
Lowest Miss Rate
LightGBM
45.95%
EDR – Metrics by Model
EDR – ROC Curves
EDR – Precision–Recall Curves
EDR – Predicted Probability Distributions
EDR – Threshold Sweep
EDR: Logistic Regression – Detailed Analysis
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9970 | 0.9436 | 0.9695 | 16213.0000 |
| 1 | 0.0297 | 0.3784 | 0.0551 | 74.0000 |
| accuracy | nan | nan | 0.9410 | 16287.0000 |
EDR – Logistic Regression: Feature Importance
EDR – Logistic Regression: Feature Importance
EDR: Random Forest (SMOTE) – Detailed Analysis
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9965 | 0.8857 | 0.9378 | 16213.0000 |
| 1 | 0.0123 | 0.3108 | 0.0236 | 74.0000 |
| accuracy | nan | nan | 0.8831 | 16287.0000 |
EDR – Random Forest (SMOTE): Feature Importance
EDR – Random Forest (SMOTE): Feature Importance
EDR: LightGBM – Detailed Analysis
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9970 | 0.7041 | 0.8254 | 16213.0000 |
| 1 | 0.0083 | 0.5405 | 0.0163 | 74.0000 |
| accuracy | nan | nan | 0.7034 | 16287.0000 |
EDR – LightGBM: Feature Importance
EDR – LightGBM: Feature Importance
EDR: Balanced RF – Detailed Analysis
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9973 | 0.8933 | 0.9424 | 16213.0000 |
| 1 | 0.0198 | 0.4730 | 0.0381 | 74.0000 |
| accuracy | nan | nan | 0.8914 | 16287.0000 |
EDR – Balanced RF: Feature Importance
EDR – Balanced RF: Feature Importance
EDR: SGD SVM – Detailed Analysis
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9967 | 0.9349 | 0.9648 | 16213.0000 |
| 1 | 0.0222 | 0.3243 | 0.0416 | 74.0000 |
| accuracy | nan | nan | 0.9322 | 16287.0000 |
EDR – SGD SVM: Feature Importance
EDR – SGD SVM: Feature Importance
EDR: IsolationForest – Detailed Analysis
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9956 | 0.9960 | 0.9958 | 16213.0000 |
| 1 | 0.0441 | 0.0405 | 0.0423 | 74.0000 |
| accuracy | nan | nan | 0.9916 | 16287.0000 |
EDR – IsolationForest: Feature Importance
Feature importance not available for this model type.
XDR: Dataset Loading & Preprocessing
XDR – Train/Test Overview
• Train shape: (185442, 34) | Test shape: (16287, 34)
• Total train samples: 185,442 | Total test samples: 16,287
• Number of features: 30
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (185442, 34) | Test shape: (16287, 34)
• Total train samples: 185,442 | Total test samples: 16,287
• Number of features: 30
• Target column: 'label'
• Missing values (train): 0 | (test): 0
XDR – Train Class Distribution
• 0: 184,585
• 1: 857
• Class balance (minority/majority): 0.4643%
• 0: 184,585
• 1: 857
• Class balance (minority/majority): 0.4643%
XDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.4643% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
• Minority class represents only 0.4643% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9955
XDR: Model Performance Comparison
XDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.9406 | 0.6002 | 0.0204 | 0.2568 | 0.0378 | 0.6567 | 0.0233 |
| Random Forest (SMOTE) | 0.9137 | 0.5800 | 0.0132 | 0.2432 | 0.0250 | 0.8037 | 0.0499 |
| LightGBM | 0.8390 | 0.6299 | 0.0119 | 0.4189 | 0.0231 | 0.7564 | 0.0111 |
| Balanced RF | 0.8951 | 0.6984 | 0.0217 | 0.5000 | 0.0415 | 0.8584 | 0.0671 |
| SGD SVM | 0.8709 | 0.6123 | 0.0125 | 0.3514 | 0.0241 | nan | nan |
| IsolationForest | 0.9944 | 0.5129 | 0.0909 | 0.0270 | 0.0417 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 15300 | 913 | 55 | 19 | 5.63% | 74.32% |
| Random Forest (SMOTE) | 14864 | 1349 | 56 | 18 | 8.32% | 75.68% |
| LightGBM | 13633 | 2580 | 43 | 31 | 15.91% | 58.11% |
| Balanced RF | 14541 | 1672 | 37 | 37 | 10.31% | 50.00% |
| SGD SVM | 14158 | 2055 | 48 | 26 | 12.68% | 64.86% |
| IsolationForest | 16193 | 20 | 72 | 2 | 0.12% | 97.30% |
Best Models by Metric
Accuracy
IsolationForest
0.9944
Balanced Acc
Balanced RF
0.6984
Precision
IsolationForest
0.0909
Recall
Balanced RF
0.5000
F1
IsolationForest
0.0417
ROC-AUC
Balanced RF
0.8584
PR-AUC
Balanced RF
0.0671
Lowest False Positive Rate
IsolationForest
0.12%
Lowest Miss Rate
Balanced RF
50.00%
XDR – Metrics by Model
XDR – ROC Curves
XDR – Precision–Recall Curves
XDR – Predicted Probability Distributions
XDR – Threshold Sweep
XDR: Logistic Regression – Detailed Analysis
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9964 | 0.9437 | 0.9693 | 16213.0000 |
| 1 | 0.0204 | 0.2568 | 0.0378 | 74.0000 |
| accuracy | nan | nan | 0.9406 | 16287.0000 |
XDR – Logistic Regression: Feature Importance
XDR – Logistic Regression: Feature Importance
XDR: Random Forest (SMOTE) – Detailed Analysis
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9962 | 0.9168 | 0.9549 | 16213.0000 |
| 1 | 0.0132 | 0.2432 | 0.0250 | 74.0000 |
| accuracy | nan | nan | 0.9137 | 16287.0000 |
XDR – Random Forest (SMOTE): Feature Importance
XDR – Random Forest (SMOTE): Feature Importance
XDR: LightGBM – Detailed Analysis
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9969 | 0.8409 | 0.9122 | 16213.0000 |
| 1 | 0.0119 | 0.4189 | 0.0231 | 74.0000 |
| accuracy | nan | nan | 0.8390 | 16287.0000 |
XDR – LightGBM: Feature Importance
XDR – LightGBM: Feature Importance
XDR: Balanced RF – Detailed Analysis
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9975 | 0.8969 | 0.9445 | 16213.0000 |
| 1 | 0.0217 | 0.5000 | 0.0415 | 74.0000 |
| accuracy | nan | nan | 0.8951 | 16287.0000 |
XDR – Balanced RF: Feature Importance
XDR – Balanced RF: Feature Importance
XDR: SGD SVM – Detailed Analysis
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9966 | 0.8732 | 0.9309 | 16213.0000 |
| 1 | 0.0125 | 0.3514 | 0.0241 | 74.0000 |
| accuracy | nan | nan | 0.8709 | 16287.0000 |
XDR – SGD SVM: Feature Importance
XDR – SGD SVM: Feature Importance
XDR: IsolationForest – Detailed Analysis
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9956 | 0.9988 | 0.9972 | 16213.0000 |
| 1 | 0.0909 | 0.0270 | 0.0417 | 74.0000 |
| accuracy | nan | nan | 0.9944 | 16287.0000 |
XDR – IsolationForest: Feature Importance
Feature importance not available for this model type.